fix: hostname port issue #4816

LanceEa · 2023-01-26T18:11:47Z

Description

Note - this PR looks to address the broken behavior and not to address developer experience (DX) with the Host.hostname and ports. I think that will take more thought and potentially breaking changes so the goal of this PR is to restore existing behavior users had in v1.Y and to fix broken envoy configuration

When trying to expose a Host with a hostname that includes a non-standard port such as 'example.com:8500' over a TLS connection, a client would receive a 404 NR's due to Envoy not being able to find a valid route.

For example:

apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host
  namespace: default
spec:
  hostname: 'example.com:8500'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend
  namespace: default
spec:
  hostname: "example.com:8500"
  prefix: /backend/
  service: quote

In v1.14, emissary would group everything under a single FilterChain with a wildcard virtual_host ("*") and then do :authority header matching using header matches on the Route. This allowed the above configuration to work but had the downside of causing large memory usage and slower route matching due to lumping all routes into a single virtual host.

In v2.Y/ v3.Y, this was addressed by creating separate Filter Chains for each Host. A non-tls Host would get a single FilterChain with multiple virtual_hosts per Host. A Host with TLS will produce a 1-1 FilterChain and virtual_host. This works fine in most cases when downstream clients are connecting over standard ports (80/443) because the dns portion of the :authority header matched the virtual_host.domain. But, when a client needs to connect on something like
the example above this would effectively generate a Filter Chain that could never match on an incoming request due to the overloaded nature of how we use Host.hostname.

Before Fix (psuedo envoy config):

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com:8500"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500

After Fix

In the non-tls scenario there are no changes to what gets generated since we behave similar to v1.y in that we don't have sni to differentiate FilterChains so virtual_hosts per Host are generated along with route header matching.

In the TLS scenario we now parse the hostname to determine the sni. SNI doesn't not understand ports which is why the configuration above is effectively dead and no connections would ever be made to that Filter Chain.

So, example.com:8500 would become:

sni=example.com
hostname=example.com:8500

We now create a Filter Chain per sni value and add a virtual host for the full hostname which will now produce:

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500

If a second Host with a hostname=example.com is provided then we would have a conflict with the 1 FilterChain to 1 virtual Host design that is used in the TLS scenario. Therefore, we will attempt to merge these into a single Filter Chain with
multiple virtual_host. We can only do this merge when the same TLSContex is used because if we didn't then we wouldn't know which TLS settings because these are set at the FilterChain level and not the virtual host level. This also means that using Host.tls is not supported in this scenario because internally those are translated to implicit TLSContext and would have the same issue where two Host could be in conflict.

apiVersion: getambassador.io/v3alpha1
kind: TLSContext
metadata:
  name: my-tls-context
  namespace: default
spec:
  secret: tls-cert
  hosts: ["*.local", "*.local:8500"]
---
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host-2
  namespace: default
spec:
  hostname: 'example.com'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host
  namespace: default
spec:
  hostname: 'example.com:8500'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend
  namespace: default
spec:
  hostname: "example.com:8500"
  prefix: /backend/
  service: quote
 ---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend-2
  namespace: default
spec:
  hostname: "example.com"
  prefix: /backend2/
  service: quote

In that scenario we would produce the following:

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500
      - domains: ["example.com"]
        routes:
          - prefix: /backend2/
            headers:
              - name: ":authority"
                exact_match: example.com

Related Issues

I might be missing some but these capture it mostly.

Testing

CI is green here and pulled into EdgeStack as well as seen here: https://github.com/datawire/apro/pull/3228

I have added some unit tests that are similar to the existing golden test we had a while back. They allowed for a much faster dev-loop but they may suffer from the same challenges we had before where any touch to the default generated config will require them to be updated. I have left them in place for now but we may want to revisit it if it becomes a pain.

Checklist

Does my change need to be backported to a previous release?
I made sure to update CHANGELOG.md.
This is unlikely to impact how Ambassador performs at scale.
My change is adequately tested.
I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.
The changes in this PR have been reviewed for security concerns and adherence to security best practices.

When unit testing diagd most of the tests us the `econf_compile` helper which by default will assert that no aconf errors existed. This is good but it requires puttng a debugger on it to grab the errors. This prints out the object so that when the test is run outside of an IDE with a debugger the errors can be seen. Altough its not neatly diffed/printed due to being a nested structure. There is an issue to improve this but looks to be still open: pytest-dev/pytest#1531. Signed-off-by: Lance Austin <laustin@datawire.io>

When trying to expose a Host with a hostname that includes a port such as 'example.com:8500', causes 404 NR's due to the way envoy was configured. In the v1.14, we used to group everything under a wildcard virtual_host domain "*" and did not include sni matching on the Filter Chain. This allowed that configuration to work but had the downside of causing large memory usage and slower route matching due to lumping all routes into a single virtual host. If the v2.Y and v3.Y, series this was addressed by creating seperate Filter Chains for each host. A non-tls Host would get a single Filter Chain with multiple virtual_hosts per Host. A Host with TLS would produce a 1-1 FilterChain and Virtualhost. This works fine in most cases when downstream clients are connecting on standard ports (80/443) but when a client needs to connect on something like example.com:8500 this would effectively generate a Filter Chain that could never match on an incoming request. In the non-tls scenario there are no changes to what gets generated. In the TLS scenario we now parse the hostname into two entites, sni and virtual_host.domain. So, example.com:8500 would have an sni of example.com and a virtual_host.domain of example.com:8500. We create a Filter Chain for the sni value and add a virtual host for the domain value. If a second Host with just `example.com` is provided as well then we will attempt to merge these into a single Filter Chain with multiple virtual_host. We can only do this when the same TLSContext is used because we wouldn't know which attributes to take from which host since all the transport_socket settings are at the Filter Chain level. This restores existing behavior in a backwards compatiabile way and doesn't try to solve the Developer Experience (DX) with the way the Host is currently designed. Signed-off-by: Lance Austin <laustin@datawire.io>

Alice-Lilith

Question about the changelog but otherwise no concerns

docs/releaseNotes.yml

haq204 · 2023-01-30T16:31:44Z

python/ambassador/ir/irhost.py

@@ -391,17 +394,17 @@ def save_context(self, ir: "IR", ctx_name: str, tls_ss: SavedSecret, tls_name: s
        # TLS config is good, let's make sure the hosts line up too.
        context_hosts = ctx.get("hosts")

-        # XXX WTF? self.name is not OK as a hostname! Leaving this for the moment, but it's
-        # almost certainly getting shredded before 2.0 GAs.
-        host_hosts = [self.name]


what's the ramifications of getting rid of name? Basically why can we do it now and it wasn't done before?

as we discussed offline, I dug into this deeper and although we don't document this anywhere there is a potential that a user could ignore our docs and intpret the TLSContext.hosts as needing to match the name of a Host.name.

Although, it is unlikely and it is not our intended behavior,I will add it back since it is not crucial to the fix for this PR. I will capture it along with the other deprecated TLS features that I outlined and we can plan for its removal in the future.

Here is the overall issue on this topic:
#4826

I make a mention to this and that it is deprecated. I will have a fixup commit here shortly that does remove some dead code where we check self.hostname or self.name because we default self.hostname to "*" if user doesn't provide it.

FTR I'm the one who wrote that comment. self.name is not guaranteed to be something that's actually sane for a DNS match, IIRC, so I'm in favor of it going away. However, it used to be important that host_hosts never be empty; might be worth a test.

(Newer folks, if you see a comment that starts with XXX, it's almost certainly something I wrote. 🙂)

haq204 · 2023-01-31T18:04:40Z

python/ambassador/envoy/v3/v3listener.py

            other = next(iter(self.hosts.values()))
            other_type = "TCPMapping" if isinstance(other, IRTCPMappingGroup) else "Host"
            tcpmapping.post_error(
-                "TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}"
+                f"TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}"


so we're not going to allow a Host resource with hostname example.com:5000 and a TCPMapping with host example.com regardless of port specified on the TCPMapping?

What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.

Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port

so we're not going to allow a Host resource with hostname example.com:5000 and a TCPMapping with host example.com regardless of port specified on the TCPMapping?

Correct, in v3listeners.finalize we process TCPMappings first and their filter chains get precedence on a Listener. Luke left a comment in that function that outlines it well, the reasoning for this has to do with providing sensible upgrade paths.

Since TCPMapping gets precedence in the scenario you outline they are both trying to generate a FilterChain with the same FilterChainMatch (TLS & SNI (example.com) which Envoy doesn't allow. Technically, you could differentiate these filter chains by adding alpn application_protocol: [h2,http/1.1] and Envoy would accept it but Envoy itself calls out the fact that ALPN is not reliable so I think this would cause more headaches then it benefits.

What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.

Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port

Correct, that would work because the TCPMapping would attach to an existing Listener or implicitly create one behind the scenes for that address/port thus there would be no overlap.

Couple of things

I think this function where you commented on, can be confusing because it relies on having knowledge of the TCPMappings taking precedence in V3Listeners.finalize and also the method chaining where the V3Listener.add_chain is shared by TCP/HTTP chains and then returns the V3Chain. So, mentally you have switch context pending which code path led you here.

While I was looking into this I did find a bug when trying to apply a non-tls cleartext TCPMapping to a listener that serves HTTP. I have captured that here: tcpmappings: invalid envoy configuration generated when applying non-tls TCPMapping on listener with HTTP #4825

haq204 · 2023-01-31T18:06:06Z

python/ambassador/envoy/v3/v3listener.py

@@ -62,7 +64,8 @@ def __init__(self, config: "V3Config", context: Optional["IRTLSContext"]) -> Non

        self.context = context


this is honestly just a nit but I keep getting confused what context stands for and have to constantly look up to see it refers to a TLSContext 😆

agree...but its fairly pervasive and would be out of scope for this PR :).

a. @haq204, you're not wrong. 😐
b. A bunch of this is stuff I wrote before TLSContext existed as a named thing.
c. Sorry. 😇

haq204 · 2023-01-31T18:25:49Z

python/ambassador/envoy/v3/v3listener.py

-                self.add_chain("tcp", context, group_host).add_tcphost(irgroup)
+
+                sni = group_host.rsplit(":", 1)[0] if group_host else "*"
+                self.add_chain("tcp", context, group_host, group_host).add_tcphost(irgroup)


I think you meant to pass in the sni variable?

yup good catch

I'm actually going to push a fixup commit for this. I had originally done this for symmetry between HTTP and TCP but the TCPMapping already has host and port. The port is required but the host is optional if the user wants to do SNI and match against TLSContext. Therefore, for this PR I think it is out-of-scope and we can leave it. But, maybe it would be good to provide a earlier TCPMapping validation on the host to ensure the user hasn't done it. That to me though is a new feature and could be done in a follow-up PR.

Signed-off-by: Lance Austin <laustin@datawire.io>

LanceEa force-pushed the laustin/fix-hostname-port-issue branch 3 times, most recently from 79c3b19 to 692b558 Compare January 27, 2023 12:35

LanceEa force-pushed the laustin/fix-hostname-port-issue branch from 692b558 to 4847a15 Compare January 27, 2023 13:34

LanceEa marked this pull request as ready for review January 27, 2023 17:10

LanceEa requested review from Alice-Lilith and haq204 January 27, 2023 17:10

Alice-Lilith previously approved these changes Jan 30, 2023

View reviewed changes

docs/releaseNotes.yml Show resolved Hide resolved

haq204 reviewed Jan 31, 2023

View reviewed changes

fixup! fix: envoy config generate when Host.hostname includes port

b4cb957

Signed-off-by: Lance Austin <laustin@datawire.io>

LanceEa dismissed Alice-Lilith’s stale review via b4cb957 February 2, 2023 16:58

fixup! fix: envoy config generate when Host.hostname includes port

f7ae027

Signed-off-by: Lance Austin <laustin@datawire.io>

haq204 approved these changes Feb 2, 2023

View reviewed changes

Alice-Lilith approved these changes Feb 2, 2023

View reviewed changes

LanceEa merged commit f70b5ae into master Feb 2, 2023

LanceEa deleted the laustin/fix-hostname-port-issue branch February 2, 2023 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hostname port issue #4816

fix: hostname port issue #4816

LanceEa commented Jan 26, 2023 •

edited

Loading

Alice-Lilith left a comment

haq204 Jan 30, 2023

LanceEa Feb 1, 2023

LanceEa Feb 2, 2023

kflynn Feb 3, 2023

kflynn Feb 3, 2023

haq204 Jan 31, 2023

LanceEa Feb 2, 2023

haq204 Jan 31, 2023

LanceEa Feb 1, 2023

kflynn Feb 3, 2023

haq204 Jan 31, 2023

LanceEa Feb 1, 2023

LanceEa Feb 2, 2023

		@@ -62,7 +64,8 @@ def __init__(self, config: "V3Config", context: Optional["IRTLSContext"]) -> Non

		self.context = context

fix: hostname port issue #4816

fix: hostname port issue #4816

Conversation

LanceEa commented Jan 26, 2023 • edited Loading

Description

Before Fix (psuedo envoy config):

After Fix

Related Issues

Testing

Checklist

Alice-Lilith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Couple of things

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LanceEa commented Jan 26, 2023 •

edited

Loading